{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "<h1>Creating a searchable Art Database with The MET's open-access collection</h1>"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "In this example, we show how you can enrich data using Cognitive Skills and write to an Azure Search Index using SynapseML. We use a subset of The MET's open-access collection and enrich it by passing it through 'Describe Image' and a custom 'Image Similarity' skill. The results are then written to a searchable index."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "source": [
    "import os, sys, time, json, requests\n",
    "from pyspark.ml import Transformer, Estimator, Pipeline\n",
    "from pyspark.ml.feature import SQLTransformer\n",
    "from pyspark.sql.functions import lit, udf, col, split"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "from pyspark.sql import SparkSession\n",
    "\n",
    "# Bootstrap Spark Session\n",
    "spark = SparkSession.builder.getOrCreate()\n",
    "\n",
    "from synapse.ml.core.platform import running_on_synapse, find_secret\n",
    "\n",
    "if running_on_synapse():\n",
    "    from notebookutils.visualization import display"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "source": [
    "cognitive_key = find_secret(\"cognitive-api-key\")\n",
    "cognitive_loc = \"eastus\"\n",
    "azure_search_key = find_secret(\"azure-search-key\")\n",
    "search_service = \"mmlspark-azure-search\"\n",
    "search_index = \"test\""
   ],
   "outputs": [],
   "metadata": {
    "collapsed": true
   }
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "source": [
    "data = (\n",
    "    spark.read.format(\"csv\")\n",
    "    .option(\"header\", True)\n",
    "    .load(\"wasbs://publicwasb@mmlspark.blob.core.windows.net/metartworks_sample.csv\")\n",
    "    .withColumn(\"searchAction\", lit(\"upload\"))\n",
    "    .withColumn(\"Neighbors\", split(col(\"Neighbors\"), \",\").cast(\"array<string>\"))\n",
    "    .withColumn(\"Tags\", split(col(\"Tags\"), \",\").cast(\"array<string>\"))\n",
    "    .limit(25)\n",
    ")"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworkSamples.png\" width=\"800\" style=\"float: center;\"/>"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "source": [
    "from synapse.ml.cognitive import AnalyzeImage\n",
    "from synapse.ml.stages import SelectColumns\n",
    "\n",
    "# define pipeline\n",
    "describeImage = (\n",
    "    AnalyzeImage()\n",
    "    .setSubscriptionKey(cognitive_key)\n",
    "    .setLocation(cognitive_loc)\n",
    "    .setImageUrlCol(\"PrimaryImageUrl\")\n",
    "    .setOutputCol(\"RawImageDescription\")\n",
    "    .setErrorCol(\"Errors\")\n",
    "    .setVisualFeatures(\n",
    "        [\"Categories\", \"Description\", \"Faces\", \"ImageType\", \"Color\", \"Adult\"]\n",
    "    )\n",
    "    .setConcurrency(5)\n",
    ")\n",
    "\n",
    "df2 = (\n",
    "    describeImage.transform(data)\n",
    "    .select(\"*\", \"RawImageDescription.*\")\n",
    "    .drop(\"Errors\", \"RawImageDescription\")\n",
    ")"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": true
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "<img src=\"https://mmlspark.blob.core.windows.net/graphics/CognitiveSearchHyperscale/MetArtworksProcessed.png\" width=\"800\" style=\"float: center;\"/>"
   ],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "Before writing the results to a Search Index, you must define a schema which must specify the name, type, and attributes of each field in your index. Refer [Create a basic index in Azure Search](https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index) for more information."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "source": [
    "from synapse.ml.cognitive import *\n",
    "\n",
    "df2.writeToAzureSearch(\n",
    "    subscriptionKey=azure_search_key,\n",
    "    actionCol=\"searchAction\",\n",
    "    serviceName=search_service,\n",
    "    indexName=search_index,\n",
    "    keyCol=\"ObjectID\",\n",
    ")"
   ],
   "outputs": [],
   "metadata": {
    "scrolled": false
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The Search Index can be queried using the [Azure Search REST API](https://docs.microsoft.com/rest/api/searchservice/) by sending GET or POST requests and specifying query parameters that give the criteria for selecting matching documents. For more information on querying refer [Query your Azure Search index using the REST API](https://docs.microsoft.com/en-us/rest/api/searchservice/Search-Documents)"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "source": [
    "url = \"https://{}.search.windows.net/indexes/{}/docs/search?api-version=2019-05-06\".format(\n",
    "    search_service, search_index\n",
    ")\n",
    "requests.post(\n",
    "    url, json={\"search\": \"Glass\"}, headers={\"api-key\": azure_search_key}\n",
    ").json()"
   ],
   "outputs": [],
   "metadata": {
    "collapsed": true
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
